Simulate hierarchical clustering of example data (manually). Be smart, you do not need 20x20 distance matrix if you can visually determine the smallest distances. Provide the order of all mergers. Sometimes there are alternatives with the same distances - just decide which one first. Draw a tree/dendrogram of the clustering and provide one ordering of data points along that clustering tree (order of leaves in tree).
Solution:
The hierarchical clustering algorithm works as follows:
The distance used in this assignment is the Euclidean distance. The graphical representation of a matrix of distances is called a dendrogram, or tree - where the objects are joined together in a hierarchical fashion from the closest, that is most similar, to the furthest apart, that is the most different.
First we need to load the data points and calculate the distance between each pair using the function dist.
points <- data.frame(ID = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J",
"K", "L", "M", "N", "O", "P", "Q", "R", "S", "T"), x = c(3, 4, 2.1, 4, 7,
3, 6.1, 7, 4, 3, 6.2, 7, 5, 3.5, 2.5, 3.5, 5.5, 6, 0.5, 0.8), y = c(6.1,
2, 5, 6, 3, 5, 4, 2, 1.5, 2, 2, 3, 5, 4.5, 6, 5.5, 4.5, 1, 1.5, 1.2))
d <- dist(as.matrix(points[, 2:3]))
The first step in the hierarchical clustering process is to look for the pair of samples that are the most similar, that is are the closest in the sense of having the lowest dissimilarity - this is the pair E and L, with dissimilarity equal to 0, because basically they are the same point. These two samples are then joined at a level of 0 in the first step of the dendrogram, or clustering tree. The point at which they are joined is called a node.
There are several ways in which we can calculate the dissimilarity between the merged pair (E,L) and the other samples. This decision determines what type of hierarchical clustering we intend to perform. First we are going to use the maximum, or complete linkage method: the dissimilarity between the merged pair and the others will be the maximum of the pair of dissimilarities in each case. Since E and L have same dissimilarity with all other samples this will be better illustrated in the next iteration.
The process is now repeated: we are finding the smallest dissimilarity which is 0.42 for samples S and T, and then cluster them at this level. Then we recompute the dissimilarities between the merged pair (S,T) and the rest of the samples. The dissimilarity between the merged pair and the others will be the maximum of the pair of dissimilarities in each case. For example, the dissimilarity between S and A is 5.24, while the dissimilarity between T and A is 5.37. Hence, we choose the maximum of the two, 5.37, to quantify the dissimilarity between (S,T) and A. Continuing in this way we obtain a new dissimilarity matrix.
Next we will cluster B and I, but I will not show all steps because we basically repeat the procedure until we cluster all elements.
Because there are 20 objects to be clustered, there are 19 steps in the sequential process to arrive at the final tree where all objects are in a single cluster.
At the end we get the following tree:
When we are using minimal or single linkage the dissimilarity between the merged pair and the others will be the minimum of the pair of dissimilarities in each case. For example, the dissimilarity between S and A is 5.24, while the dissimilarity between T and A is 5.37. Hence, we choose the minimum of the two, 5.24, to quantify the dissimilarity between (S,T) and A. Continuing in this way we obtain a new dissimilarity matrix.
After finishing all steps we are getting the following tree:
Use the same data, simulate the K-means clustering starting from initial cluster centers (points) - A, C, F, M - as indicated in the file.
Solution:
K-means clustering is an unsupervised algorithm that works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity (in this case Euclidean distance). The algorithm consists of two steps:
The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached).
The results of the K-means clustering algorithm are:
The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)
rm(list = ls())
# Load the data
points <- data.frame(ID = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J",
"K", "L", "M", "N", "O", "P", "Q", "R", "S", "T"), x = c(3, 4, 2.1, 4, 7,
3, 6.1, 7, 4, 3, 6.2, 7, 5, 3.5, 2.5, 3.5, 5.5, 6, 0.5, 0.8), y = c(6.1,
2, 5, 6, 3, 5, 4, 2, 1.5, 2, 2, 3, 5, 4.5, 6, 5.5, 4.5, 1, 1.5, 1.2))
# Center points A,C,F,M
start <- rbind(c(3, 6.1), c(2.1, 5), c(3, 5), c(5, 5))
# Calculte Euclidean distance between the points and the center points
myEuclid <- function(data, centers) {
distanceMatrix <- matrix(NA, nrow = dim(data)[1], ncol = dim(centers)[1])
for (i in 1:nrow(centers)) {
distanceMatrix[, i] <- sqrt(rowSums(t(t(data) - centers[i, ])^2))
}
distanceMatrix
}
# K-means algorithm
myKmeans <- function(x, centers, distFun, nItter = 3) {
clusterHistory <- vector(nItter, mode = "list")
centerHistory <- vector(nItter, mode = "list")
for (i in 1:nItter) {
distsToCenters <- distFun(x, centers)
clusters <- apply(distsToCenters, 1, which.min)
centers <- apply(x, 2, tapply, clusters, mean)
clusterHistory[[i]] <- clusters
centerHistory[[i]] <- centers
}
list(clusters = clusterHistory, centers = centerHistory)
}
k <- myKmeans(points[, 2:3], start, myEuclid, nItter = 3)
k
## $clusters
## $clusters[[1]]
## [1] 1 3 2 1 4 3 4 4 3 3 4 4 4 3 1 3 4 4 2 2
##
## $clusters[[2]]
## [1] 1 3 1 1 4 1 4 4 3 3 4 4 1 3 1 1 4 4 2 2
##
## $clusters[[3]]
## [1] 1 3 1 1 4 1 4 4 3 3 4 4 1 1 1 1 4 4 2 2
##
##
## $centers
## $centers[[1]]
## x y
## 1 3.166667 6.033333
## 2 1.133333 2.566667
## 3 3.500000 3.416667
## 4 6.225000 3.062500
##
## $centers[[2]]
## x y
## 1 3.300 5.514286
## 2 0.650 1.350000
## 3 3.625 2.500000
## 4 6.400 2.785714
##
## $centers[[3]]
## x y
## 1 3.325000 5.387500
## 2 0.650000 1.350000
## 3 3.666667 1.833333
## 4 6.400000 2.785714
The following plot shows how the centers change with each iteration.
par(mfrow = c(2, 2))
for (i in 1:3) {
plot(points[, 2:3], col = k$clusters[[i]], main = paste("itteration:", i),
xlab = "x", ylab = "y", ylim = c(0, 7))
points(k$centers[[i]], cex = 3, pch = 19, col = 1:nrow(k$centers[[i]]))
}
We can check the result using the function kmeans in R.
k <- kmeans(points[, 2:3], start)
k$centers
## x y
## 1 3.325000 5.387500
## 2 0.650000 1.350000
## 3 3.666667 1.833333
## 4 6.400000 2.785714
plot(points[, 2:3], col = k$cluster, pch = 3, ylim = c(1, 7))
points(k[["centers"]], pch = 4, cex = 2, col = "blue")
text(points$x, points$y, labels = points$ID, cex = 1, pos = 3)
Use the large data set for clustering with some existing software or library. The file is an RGB version of a row-shuffled image. http://biit.cs.ut.ee/imgshuffle/data/DM2017/DM2017.txt
The directory to re-render the image in different row-order is in here: http://biit.cs.ut.ee/imgshuffle/index.cgi?fname=DM2017&dname=DM2017&shufflefile=DM2017.txt
The tool above allows you to paste in the row ID-s (R0001, R0002, etc) in a different permutation order. It will render the same image according to new order of rows. Your task is to cluster the data and fetch the row ID-s in some order provided by clustering.
Try out clustering the data using hierarchical clustering or K-means, for example.
What is depicted on the image? Insert a version of the “recovered image” to the report. Of course, it does not need to be perfect. Merely meant to visualize what you get from analysis.
Solution:
Using hierarchical clustering we can group the image’s pixels in different clusters based on their RGB values. The idea here is that pixels with similar color will be group together and we can use this information to unshuffle the image. For this exercise I use both single and complete linkage.
# Read the text file, (column names already deleted)
image = read.csv("C:/Users/Arsovska/Desktop/DM2017.txt", sep = " ", header = FALSE,
comment.char = "r")
image = image[-1]
# Save the original rows ordering
rownames(image) = paste("", seq(1, length(data[, 1])), sep = "")
library(seriation)
# Calculate distance
d = dist(data)
# Hierarhical clustering - min distance
hc_s = hclust(d, method = "single")
get_order(hc_s)
result1 = rownames(image[get_order(hc_s), ])
result1 <- as.numeric(unlist(result1))
# Transform the row names
uns_rows1 <- ifelse(result1 >= 100, paste("R00", result1, sep = ""), ifelse((result1 <=
99 & result >= 10), paste("R000", result1, sep = ""), ifelse(result1 <=
9, paste("R0000", result1, sep = ""), result1)))
write.table(uns_rows1, "C:/Users/Arsovska/Desktop/seriation1.txt", row.names = F,
col.names = F, quote = F)
# Hierarhical clustering - max distance
hc_c = hclust(d, method = "complete")
get_order(hc_c)
result2 = rownames(image[get_order(hc_c), ])
result2 <- as.numeric(unlist(result2))
uns_rows2 <- ifelse(result2 >= 100, paste("R00", result2, sep = ""), ifelse((result2 <=
99 & result2 >= 10), paste("R000", result2, sep = ""), ifelse(result2 <=
9, paste("R0000", result2, sep = ""), result2)))
write.table(uns_rows2, "C:/Users/Arsovska/Desktop/seriation2.txt", row.names = F,
col.names = F, quote = F)
The primary use of hierarchical clustering is not for image processing so the reconstructed images are not perfect.
We can get far better results if for example we use tsp with Manhattan distance instead.
Read the analysis “Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance” http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/
Solution:
The article covers NYC transit data from taxis, uber and citi bikes. Some of the analysis tools that were used were: PostgreSQL, PostGIS and R. The author in another article explained some of the problems he encountered with the data like: some files contained empty lines and unquoted carriage returns, contained extra columns and had different data formats for the same cab type. As a result data cleaning needed to be preformed before analyzing the data. The author presents various plots and maps that present how cab rides and times depends on different factors. He also explains how to work with big data more efficiently. The plots were created using R’s ggplot and they represent different analysis that he made. For example the number of pickups for different cab types in different neighbors in New York was shown. The author also calculated the distribution of how long it took to travel from each neighborhood to the airports at each hour of the day and presents some median traveling time results. He also uses histograms to present some analysis for travel times and dropouts. Using the Central Park weather data the author shows some relationships between the weather conditions and the uber trips.
This data set can be used to identify many business needs. For example the data can be used to determine how to position the drivers in order to maximize their earnings and if it is better to remain stationary between trips, go back to the high-demand hotspots after each trip or drive around randomly between trips. Also, depending on the past trends, the data can be used to predict future growth of the company. Companies and night clubs can identify the number of clients or employees that use cab services often and make valuable partnerships with them. Traveling time can be used to find optimal routes in different time periods and drivers can be informed how their driving patterns compare to other drivers in their city-with suggestions on how to provide a smoother, safer ride.